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ABSTRACT 

Structural leEirmng as proposed by Ishikawa [1] has had 
success in reducing unimportant weights in an initially 
oversized multilayer feedforward neural network during 
training. This technique employs a forgetting constant 
£ which determines the rate of weight decay during 
leaxning. A formalization of this method is obtained 
by considering the weights as the states of a dynamic 
system. Analysis of the system eigenvalues and simula- 
tion indicate that € controls removal of fixed points cor- 
responding to the weights of oversized networks. Thus 
considering error backpropagation as a dynamic system 
has enabled analysis of structural learning. 

1. INTRODUCTION 

Structural learning of feedforward neural networks of- 
fers effective pruning with a relatively simple learning 
rule [1]. This learning rule includes a term which forces 
weights to decay during training. Near minimal net- 
works can be obtained by applying this rule to ini- 
tially oversized networks. Many of the weights are typ- 
ically small at the conclusion of training, and thus both 
weights and neurons may be subsequently pruned. Al- 
though the structured learning rule with its forgetting 
constant is well understood intuitively, the choice of a 
specific value of the constant is somewhat arbitrary [2], 
In order to analyze the structural learning, the learn- 
ing must be expressed more formally. Based on a 
given training set the error backpropagation (EBP) al- 
gorithm moves the weights from some random initial 
condition towards the final value corresponding to the 
closest local minimum of an error function. The fact 
that during learning the weights evolve in time while 
the training set remains invariant suggests that the net- 
work weights can be considered as states of a dynamic 
system [3]. 

The objective of this paper is to characterize solu- 
tions which correspond to fixed points in the weight 
space in terms of their stability with respect to the for- 



getting constant, which is considered to be a parameter 
of the learning dynamics. It will be shown that this pa- 
rameter controls the eigenvalues of the fixed points [4] 
and hence their convergence properties. Thus the for- 
getting constant determines the number of stable fixed 
points in the weight space. The best pruning occurs 
when all the solutions corresponding to oversized net- 
works are removed from the weight space by training 
with an appropriate forgetting constant. These solu- 
tions become unreachable from any initial condition. 

2. EBP LEARNING AS A DYNAMIC 
SYSTEM 

Assume that iD is a vector composed of the neural net- 
work weights. Considering backpropagation learning as 
a dynamic system, the weight vector w is understood to 
be the state of this system. The standard EBP learn- 
ing rule [5] is a discrete time mapping from Wn to iBn+i 
given by 

(" 

where E{w) is the error function for a given training 
set. More specifically, £"(11;) is a functional defined at 
each state w which can be interpreted as the state po- 
tential in the weight space. Starting from some initial 
condition wq the dynamic rule (1) determines the cas- 
cade trajectory {wn} which leads to the nearest local 
minimum of the functional E. The minimum is located 
at a fixed point wf which is an attractor of the trajec- 
tory {Wn}' 

As 7/ — > 0 the discrete-time dynamic system with 
rule (1) becomes an approximation of a continuous- 
time dynamic system described by the differential equa- 
tion [6] 

^ = F(w), wiO)=-wo, (2) 

wherfe the vector function is defined as F{w) = 
-dE{w)/dw. Thus the derivative of the state is pro- 
proportional to the gradient of the potential function 
EitS) with a negative proportionality coeflBcient. Tra- 
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jectory w{t) is a continuous manifold in the weight 
space, leading from the initial condition uJq, passing 
through points {wn}j and eventually ending at an at- 
tractor wp. 

Ishikawa suggested [1] a modification of equation (1) 
to force the weights to decay during learning. Express- 
ing Ishikawa's structural learning with a continuous- 
time dynamic rule as proposed yields 



dt 



BE 
'dwf 



-esignuj^. 



(3) 



Here wfj is an entry in weight vector w corresponding 
to the weight connecting the j-th neuron from the layer 
preceding the 5-th layer (or input) with the i-th neu- 
ron from the 5-th layer. Intuitively, the first term in 
rule (3) moves state w to minimize the original EBP 
functional E while the second term forces weight de- 
cay. As a result of these two conflicting criteria both 
learning and weight decay occur— some of the weights 
are suppressed to zero while the remaining weights are 
responsible for the error minimization. The parameter 
e controls the level of pruning. 

More formally, the pruning effect can be explained in 
terms of the stability of the fixed points created in the 
weight space by equation (3). The new locations of the 
fixed points is determined by the modified potential 
function Ee(w) = f Fe{w)dw, To enable integration 
the signum function on the right side of equation (3) 
can be approximated as sign{wij) « tanh(j5uiij) with 
0 sufficiently large. Thus the smoothed dynamic rule 
reads 

dw{t) dEe{w) ^ /A\ 
= F^{w) = - e tanh(^u;). (4) 

The location of fixed points in the weight space can be 
determined by setting the time derivatives of state w 
as in (4) to zero. Thus the fixed point equation is 



(5) 



In order to determine the stability of the fixed 
point wp, Fc can be linearized in the neighborhood 
wp 4- AtiJ, yielding 



d . ^ dFeiw) 
dt dw 



Aw. 



(6) 



The stability of the fixed point wp depends upon 
the convergence properties of its neighborhood. The 
fixed point is an attractor if all the eigenvalues of 
the Jacobian Jc{y^p) — dPc{w)/dw evaluated at tx; = 
tjjp are non-positive real numbers. Let Xmax he the 
mfiximum eigenvalue of the Jacobian J^i^p). As 



long as X^nax is non-positive, the trajectory Aw{t) — 
AiiJoexp(Je(iI;F)t), starting from an initial state Aiuo, 
converges to the fixed point wp. If X^ax is positive, 
the trajectory departs from wp in the direction of the 
eigenvector corresponding to that eigenvalue. Thus, 
only fixed points with non-positive X^ax should be con- 
sidered as attract ors in the weight space of the neural 
network during learning. Unstable fixed points are not 
reachable from any initial condition i^o except for those 
located exactly at separatrices ofwp. 

We may therefore group the system eigenvalues at 
the fixed point into two classes, namely those which axe 
negative and those which are zero. Negative eigenval- 
ues are responsible for the attracting directions in the 
weight space. Zero eigenvalues indicate directions in 
which movement of the weights does not affect the po- 
tential functional. Zero-valued eigenvalues effectively 
reduce the order of the system. 

Note that system (4) is parameterized with a forget- 
ting rate e [4]. For e = 0, the dynamic system rep- 
resents standard EBP, and thus there may be many 
stable fixed points, representing the many possible so- 
lutions of the learning process. For a large value of 
e, the second term in equation (4) dominates the first 
term, and the system simplifies to a set of first order 
independent differential equations. Furthermore, this 
system is linear except in a small region around the 
origin of the weight space. The only fixed point for the 
C£Lse of large e is the origin of the weight space. Here 
the system Jacobian Je{yjp) has non-zero entries only 
along the diagonal, and these entries are proportional 
to 

Thus £ affects the stability of fixed points. Specif- 
ically, if Xmax of a fixed point is negative but close 
to zero, changing e may cause this eigenvalue to be- 
come positive. This makes the fixed point unstable 
and eliminates it from the weight space as a solution 
of the leeurning process. Hence, learning with various 
forgetting rates e results in a different density of sta- 
ble fixed points wp. The question arises as to how the 
forgetting rate is related to the density of these fixed 
points and why certain values of £ are recommended for 
the best pruning results[2]. The sensitivity of Xmax to 
changes of £ at the fixed points is crucial for answering 
this question. 

Consider that equation (4) is a dynamic rule of learn- 
ing with parameter e. Therefore a given fixed point is 
a function of this parameter: wp = wp(£). Tetking into 
account the fixed point equation (5), the movement of 
the fixed point wp with respect to € can be approxi- 
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mated as 

ds y dwF J de ' 

For a small change Ae the fixed point moves from wf 
to the new location Wp* As result of this movement 
the new fixed point has new convergence properties. 

3. NUMERICAL RESULTS 

In order to demonstrate the relation between the eigen- 
values of fixed points in the weight space and the pa- 
rameter 5, system (4) was simulated for the case of 
block training a 3-5-1 multilayer feedforward network 
for the XOR problem. The desired outputs for the 
network were either ±0.9 to allow the neurons to reach 
the desired output states. The system was intialized to 
a random initial condition wq and allowed to evolve in 
time to a fixed point. Once at the fixed point, € was 
perturbed, and the system was again brought into equi- 
librium. For each new position of the fixed point, the 
system eigenvalues were evaluated. Figure 1 shows the 
system eigenvalues for a fixed point as it moves with 
changes of e. Note that at g: = 0 both non-zero and zero 
eigenvalues exist, but with an increase in £ the system 
becomes full order. 

Figure 2 shows the maximum system eigenvalue as 
£ is varied. Note that the fixed point becomes more 
stable up to a critical value of e, where the stability of 
the fixed point 'stzurts to decrease. A further increase 
of s causes this fixed point to become unstable, forcing 
the system to another fixed point, which in this case 
is the origin. This state transition is indicated by the 
discontinuous jump to the second distinct part of the 
curve, which corresponds to the convergence properties 
of the origin. 

Figure 3 shows the error of the learning process for 
the same fixed points. Note that for £ = 0, the error 
is essentially zero, in contrast to typical solutions from 
EBP, since EBP utilizes the first order Euler method, 
which is only asymptotically convergent to fixed points. 
For non-zero values of £, this dynamic approach re- 
sulted in weights which were either strongly non-zero 
or practically identicsJ to 0, as opposed to a situa- 
tion for EBP structural leaurning, for which some of 
the weights may be close to zero, but far enough from 
0 to cause hesitation at their removal during pruning. 
Again, since we are considering fixed points, solutions 
may no longer decay, and thus weights unimportant to 
the system mapping must be definitively zero. Note 
that this graph also reflects the increase of error inher- 
ent in the balancing of the contradictory goals of error 
and complexity minimization. 



In Figure 4 the eigenvalues of the fixed point at the 
origin are shown. With e: = 0 the origin is not stable 
as most of the eigenvalues are 0. The one non-zero 
eigenvalue corresponds to the weight connected to the 
hidden layer bias neuron. This plot demonstrates that 
the system eigenvalues at the origin are proportional 
to £. 

4. CONCLUSIONS 

The ability of structural learning to reduce the magni- 
tude of weights of a multilayer feedforward neural net- 
work has been analyzed by considering the forgetting 
rate £ as a pcirameter of a dynamic system, e con- 
trols the convergence properties of fixed points in the 
weight space which correspond to solutions of the learn- 
ing process. Furthermore, results indicate that the dy- 
namic system approach offers the abiUty to determine 
the final result of error backpropagation training, and 
thus this formalization will hopefully enable analysis 
of not only structural learning but also other training 
methodologies. 
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Figure 1. System eigenvalues for a fixed point 
as e is varied. 
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Figure 2. Maximum system eigenvalue for two 
fixed points as e is varied. 
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Figure 3. Network output error for two fixed 
points as e is varied. 
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Figure 4. Eigenvalues at the origin of the weight 
space as e is varied. 
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